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(57) Abstract 

A Quasi Radix- 16 Butterfly comprises a 
radix-4 butterfly processor and on-board memo- 
ry with external memory addressing changes 
from a conventional radix-4 butterfly processor. 
On-chip cache memory is included to store data 
outputs of the radix-4 butterfly processor for ap- 
plication as data inputs to the radix-4 butterfly 
processor in a second series of butterfly opera- 
tions to implement high-speed processing that is 
maximally execution-bound. 
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-DESCRIPTION 

pprkarnund r^r fhg> Invfintion 

5 1. fi firi of <->^^ Tnvenfcion 

The present invention relates to a LSI butterfly 
processor and method for high performance Fast Fourier 
Transform (FFT) processors. 

10 2 . pAsrriot ;?':?" Related Art 

With the present state of the art in VLSI 
technology, the speed of VLSI computing structures is 
typically limited by the data bandwidth of a VLSI array 
processor and not by silicon circuits. The data bandwidth is 
15 directly related to the number of input and output (I/O) 

pins. The level of integration has reached a point where, 
due to limited I/O bandwidth, it is not possible to achieve 
100% execution hardware utilization without increasing the 
number of I/O pins significantly. In such a situation, the 
20 design of VLSI processors should aim for minimization of the 
undesirable effects of limited I/O bandwidth while preserving 
the tremendous advantages of large computation arrays. 

Early integrated FFT processors were based on a 
radix-2 "butterfly" as illustrated in Figure la, to keep the 
25 amount of hardware required on a single chip to a minimum. A 
"butterfly" is a set of arithmetic operations commonly used 
in digital signal processing. Recently, due to advancing 
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technology, several radix-4 butterf iy-b.sed processors have 
been disclosed. These processors are a logical evolutionary 
advance from radix-2 processors that takes advantage of a 
h.-gher level of integration and thus more silicon area. Each 
5 radi.-4 butterfly requires four data inputs, three 'twiddle' 
inouts (i.e. angular velocity information), and four data 
cutouts, as shown in Figure lb. A typical multi-cycle 
radix-4 butterfly processor includes one complex data input 
,,ort, one coraolex twiddle port and one complex data output 
XO .ort, where each port can be 32 to 48 bits wide in such fixed 
point processors. Thus, for a radix-4 butterfly processor 
there is already a large number of I/O pins. Also, it is 
difficult to conceive of a radix-8 butterfly processor 
because there is a tremendous increase in the amount of 
15 hardware required for a radix-8 butterfly processor as 
compared with a- radix-4 butterfly processor. The term 
••butterfly" processor is used herein to conveniently 
designate the circuit module, described in detail later 
herein, which includes a plurality of fan-in inputs and a 
20 plurality of fan-out outputs in the schematic representation. 

It is desirable to compute Fast Fourier transforms 
(FFTs) using as high a radix as possible to alleviate the I/O 
bandwidth problem in VLSI FFT processors. The use of higher 
radices is even more desirable for FFT processors designed to 
25 handle exceptionally large FFT sizes e.g., 15 million 
5ata-point FFT. The I/O bandwidth problem in high 
erfornance FFT processors is illustrated in the timing 
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diagrarr.s of Figures 2a and 2b. Let Tio be the I/O cycle 
tiras and Teu be the execution unit pipeline cycle ti-:e. For 
a dual radix-2 butterfly, four I/O cycles are perforir^ed to 
fetch the data (Al, Bl, A2 , B2) , as shown in Figure 2a. The 
5 datapath then performs two butterfly operations (1,2) on 

fetched data in two cycles. If Tie is equal to Teu then the 
datapath cannot be kept busy continuously and is I/O bound as 
indicated by the blank cycles in the datapath timing waveform 
of Figure 2a. As indicated in Figure 23, to make the 

10 datapath execution bound, 4xTio = 2xTeu i.e., the I/O cycle 
time should have to be one-half the execution pipeline cycle 
time. Figure 2b illustrates the same situation for a radix-4 
butterfly. Here again four I/O cycles are required to fetch 
the data for a single radix-4 butterfly. If it is assumed 

15 that the datapath can execute a radix-4 butterfly every two 
cycles, then once again Tio would have to be (1/2) Teu to 
make the datapath execution bound. This then is the I/O 
bandwidth problem in high performance pipelined FFT 
processors with a limited number of I/O ports, specifically 

20 three complex ports (namely, an input port, a twiddle port 
and an output port as indicated in Figure 2). 

Summary of th° Tnvention 
In accordance with the illustrated embodiment of 
25 the present invention, radix-15 butterfly hardware is 

implemented using a multi-cycle radix-4 butterfly-based 
circui-t module that uses relatively fewer I/O pins a.nd yet 
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relieves the I/O bandvridth problem to achieve significant ly 
hicher throughput compared with a conventional radix-4 
butt^r^'iy imolementation. This invention, henceforth 
......ed to as the Quasi Hadix-lS (QH16) butterfly processor, 

.nc.udes a radix-4 butterfly processor and on-board neaory 
-.ith sorae ainor external .e.ory addressing changes froa a 
conventional radix-4 butterfly processor. The QH15 performs 
a ,uasi radix-lS butterfly in two columns of four rad.x-4 

n,i<.^'?hed in figure Ic - On-chi? 
butterfly operations as illus^.a^ea in y 

cache n^emory is included to store data outputs of the first 
column of radix-, butterflies for use as data inputs in the 
radix-4 butterfly operations of the second column to 
implement high-speed processing that is maxirr^ally 
execution-bound, as illustrated in Figure 2c. 

Figure la is a simplified diagram of a conventional 

radix-2 butterfly operation; 

Figure lb is a simplified diagram of a conventional 

radix-4 butterfly operation; 

Figure Ic is a simplified diagram of one embodiment 
of a quasi radix-16 butterfly operation according to the 

oresent invention; 

Figures 2a-c are timing ciagran-.s for the =ssocxa-.ec: 

radix-n butterfly processors rna-. ili-us n?^- 

bandwidth and execution limitations; 
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Figure 3b is a simplified schematic diegrarii of a 
preferred embodiment of the quasi raQix-16 butterfly hardwere 
implementation according to the present inventions- 
Figure 3b is a simplified chart illustrating data 
5 flow in a quasi radix-16 butterfly processor according to the 

present invention ; 

Figures 4a and 4b are pictorial representations of 
the processing sequences for a conventional radix-4 processor 
and the quasi radix-16 processor of the present invention; 
10 - Figure 5 is a timing diagram illustrating the 

impact of data dependencies in a datapath on the processing 
speed attributable to the quasi radix-16 processor of the 
present invention; 

Figure 6 is a chart illustrating the data 
15 processing for maximum execution utilization of a quasi 
radix-15 processor according to the present inventions- 
Figure 7a is a simplified diagram illustrating the 
data flo--' through an interleaved qu3si-radix-16 scheme as 
illustrated in Figure 7b, according to the present invention; 
20 Figure 7b is a simplified schematic diagram 

illustrating an interleaved quasi radix-16 butterfly 
operation with multiple twiddle inputs for operation as 
illustrated in Figure 4b according to the present invention; 
Figure 7c is a chart illustrating the address 
25 orientations of twiddle inputs within on-chip meiTiOry in 
hexadecimal notation; 
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Figure 8 is a sirr.plified diagran-. for perforrrdng s 
K-ooint Fast Fourier Transform with a quasi racix-16 
butterfly of the present invention illustrating the generic 
notation; 

5 Fiaure 9 is a block schematic diagra.T: of the 

oreferred erabodi.ent of the circuit module radix-4 processor 
for operation in the quasi radix-16 processor according to 

the present invention; 

Figure 10 is a schematic diagratu of the memory and 
10 addressing circuitry associated with the guasi-radix 16 
processor ; 

Figures 11a and lib are charts illustrating the 
input memory storage scheme fo.r data inputs; 

Figures 12a-c are charts illustrating the 
coefficient memory storage scheme for twiddle input data; 

Figures 13a and 13b are charts illustrating the 

cache memory storage scheme; and 

Figures 14a-c are charts illustrating the output 

data memory scheme. 



15 



P^^^.4^4--;r.n of i>rofgrr»n Emhoaimgnt 

In general, a Fourier transform of data points 
representing a complex function includes an ascending series 
of --ernis, each including a coefficient and an angular 
velocity (or frequency) factor, where each terrr. of the 
comolex function may include real and imaginary quantities. 
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(Eq- 1) 



Do - Dl X Wi + D, X W2 - D3 X W3 
Thus, the higher-order ter.s of the complex function, in the 
general case, require four multiplications of the real and 
imaginary components per ter., where D, is data and is a 
5 "twiddle" input such that 

(Ea- 2) 

- (Ea . 3 ) 

Wj^ = c + 3d V - ' 

Dn ^ ^^n = (a-3b)(c-^jd) = 

axe + axjd -r cxjb + jbxjd (Ec. 4) 
10 Radix-4 processing perfor^.s four multiplications per terr., or 
twelve multiplications for these higher-order terms D, x W, 
through D3 x W3. The data inputs and the twiddle inputs 

are supplied at separate inputs for processing in selectee 
sequence, as later described herein. 

Referring now to Figure Ic, there is shown the 
quasi radix-16 butterfly which includes two columns of four 
radix-4 butterflies each (A to D and E to H) , as shown in 
Figure Ic. The QR16 processor uses on-chip cache memory, 
later described herein, to store the 16 data outputs of the 
20 first column butterfly operations, for use in the second 
column butterfly operations. ngures 3a, 9 and -O 
the hardware implementation of QR16 processor around a 
r3dix-4 butterfly execution unit. As can be seen from these 
figures, the QR16 processor is implemented by adding a 

, " Ti 17 1 5 ^o a radix-4 
25 limited amount of on-Dosra rae.mory ' 

butterfly processor chip. 



15 



8 

Ecch QR15 butterfly requires 15 inpi:t dsza points 
(D0-D15) and 15 twiddle factors (wl-wl5) as shown in Figure 
Ic, which takes 16 fetch cycles with 2 cOuipiex QSta in?ur 
ports i.e., input data and twiddle inputs. Internally, eight 
5 radix-4 butterfly operations to H in Figure Ic) ere 

required. The first column uses the sarne data ordering as a 
15-?oint radis-4 fast-rourier transform (FFT) and all 
butterflies to D> have three common twiddles i.e., wl, w2, 
and w3. The second column of butterflies E through H require 

10 digit reversed data ordering and 12 unique twiddles i . e . , ■ w4 
to wl5. Without the QH16 butterfly, a 16 point radi:.:-4 FFT 
would required 32 (2x16) data inputs, 24 (2x12) twiddle 
inputs, end 32 (2x15) date outputs, since the first column 
results would have to be stored in off-chi? memory and be 

15 brought in again on-chip to complete the second column. 

With the QpAe butterfly scheme of t.he present 
invention, it is possible to achieve as much as twice the 
throug.hput of a conventional radix-4 butterfly based 
processor. The higher throughout is achieved by rnaxi-izing 

20 the utilization of the datapath, as shown in Figure 2c. It 
should be noted in Figure 2c that each data fetch and cutout 
store e.g., .^1- and -^^.l** respectively, actually takes four 
input/output (I/O) cycles. The 15 date points (A1-, B1-, Gl- 
and D1-) brought on-chip are passed twice thrcuch .he 

25 dacapath (Pass 1-1 and Pass 1-2) before being stored ir. 

external memory. T.hus , for QRiS processor cf the present 
invention, ISxTio = 16xTeu or Tio is equal tc Teu and zhe 



10 



wo 92/18940 PCT/JP92/00494 

9 

datapath is continuously busy, thus indicating an ideal I/O 

balanced environment. 

Referring now to Figure 4, the chart illustrates 
the difference between a radix-4 butterfly sequence and a 
5 QR16 butterfly sequence for IK data points of a fast-Fourier 
transforrr.. Figure 4a shows the sequence of butterfly 
operations for a . IK point FFT using only radix-4 
butterflies. Each shaded rectangle in Figure 4 represents' 
one radix 4 butterfly operation. All the butterflies of a 
single column are processed before the next column of 
butterflies are processed. Thus, a IK point FFT can be 
performed in five columns by five radix-4 passes (1024 = 
4x4x4x4x4) as shown in Figure 4a. In contrast, a IK point 
FFT can be performed in three columns by two QR15 butterfly 
passes and one radix-4 pass (1024 = 16x16x4) as. shown in 
Figure 4b. In Figure 4b, the QR16 butterfly is shown as a 
dashed line enclosing 8 radix-4 butterflies arranged in 2 
columns. In Figure 4b, processing also proceeds column by 
column except only three columns are needed because of the 
higher radix used in the first two columns. The QR15 
butterfly essentially allows two radix-4 columns to be 
processed at a time before the next stage. The QR16 
processor advantageously includes on chip cache memory 15 for 
storing real and in-.aginary data needed for processing the two 
recix-4 columns in the QRi5 butterfly. The present 
invention advantageously uses on chip memory to rearrange 
r=cix-4 butterflies to exploit data locality to increase the 
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15 



speed of computation. In addition to cache t..emory, input 
and output memories are included, as shown in Figure 3a. The 
QR16 embodiraent of the present invention includes an inherent 
•pipeline- latency problem, when implemented on a pipelined 
radix-4 datapath. In accordance with this invention, this 
latency can be eliminated by interleaving the QH15 
computations of two 16-point data blocks in the pipelined 
datapath, as later described herein. Table 1, below, 
illustrates the increases in processing speed that ere 
achieved using the QR15 of the present invention, especially 
when operated in radix-lo mode only. 



FFT Size 



Passes 



Table 1 

FFT Execution Time 
nT>l6 1 Only Radix-4 



Speed Improvemeni 
" Factor 



20 



25 



30 



. 64 

256 
IK 
4K 
16K 
64K . 
Assuming 



4x16 
16x16 
4x16x16 
15x16x15 
4x16x16x16 
16x16x16x16 
Tio = Ten 



2 . 56us 
10 .24us 
61-44US 
245 . 76us 
1 . 3 Ims 
5 .24ras 
20ns 



3 .84US 
20 .48us 
102.4 us 
491.52US 
2 .29ms 
10 .4 3ms 



1.5 
2.0 



67 
0 

,75 
0 



Tho nna5;i-r?ri-;x IS Snf-rprflv T^ni »ment:at;ion 

Referring now to Figure 3a and 10, in order to 
implement QR15 using a radix-4 module 7, a limited amount of 
on-board memory has to be included with this module on the 
same circuit chip. Memory struccures 3. 11. 13 are placed at * 
the input end output of the datapath. In ecaition, a cache 
meoiory 15 is also required. For complex data, the nemory 
word width should allow for real and imagi.iery data 



CO 



.jjjjjj-gpVg , The input and output me.T.ories ii. 
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for fiats buffering while the cache memory 15 is usee to 
store the intermediate results of column 1 of the Q?.15 
butterfly processing, as illustrated in Figure Ic . Figure 3b 
shows the basic data flow through the memories and the 
5 datapath. A more detailed version will be discussed later 

herein. Input data points are stored in the input memory 11 
and the twiddles are stored in the coefficient memory 9. 
Then, the datapath operates on this data to accomplish the 
first column of QR16 butterfly processing. The results of 
10 the first column butterflies are stored in the cache memory 

15 and once all the butterflies of column 1 are complete, the 
cache mem.ory 15 is read to perform the second column 
butterflies. The final results of the second column of a 
QR16 butterfly are stored in the output memory 13 for 
15 transfer to off-chip memory. 

Typically, multiple stage pipelined data paths are 
used to implement high performance FFT processors, e.g., a 
pipeline stage for the multiplier section, a pipeline stage 
for the adder section, a pipeline stage for the .V^U section, 
and the like. In the QRIS butterfly of the present 
invention, there exists a data dependency between the first 
column and the second column of butterfly processing. The 
results of ell first column butterflies must be available 
before- the second column calculations can begin, i.e., vith 
reference to Figure Ic, the results of butterfly D -ust be 
stored in the cache memory 15 prior to the execution of 
butterflv E. This data dependency becomes an acute problem 
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in .ipelinec data oaths where the results of a particular 
operation are not available until a specified number of clock 
cycles later due to the latency of the datapath pipeline. In 
the QHIS, this would impose a certain number of dead (or 
5 no-op) cycles into the pipeline after the last butterfly of 

the first colurr.n (butterfly D) and before the first butterfly 
of the second column (butterfly E) to avoid this data 
dependency ► 

This data dependency can be understood clearly frorrs 
the timing diagrarr. of Figure 5 which shows nine timing 
patterns corresponding to the memory read and write functions 
and the execution unit, where each of these waveforms is 

defined below: 

(a) input Memory Write - Input data from I/O pins 
written in Input Data Memory 11; 

(b) Coeff Memory Write - Twiddle data from I/O pins 
written to Coefficient Memory 9; 

(c) coeff Memory Read - Input data transfer to 
Execution Unit 7; 

20 (c) Input memory read - Input data transfer to 

Execution Unit 7; 

(e) Execute - Execution Unit activity; 

(f) Cache Write - Execution unit result data 
written to Cache Memory 15; 

(c) Cache Read - Cache data transfer to Execution 
Unit 7; 
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(h) Output Memory Write - Execution unit result 

data written to Output Data Memory 13; and 

(i) Output Memory Read - Output data read to I/O 

pins . 



Referring now to the graph of Figure 5, A and B 
refer to two different QR16 butterfly data blocks. Al and A2 
refer to the first and second columns of the A block 
respectively. Similarly, Bl and B2 refer to the first and 
second columns of the B block, as illustrated in Figure 7b. 
in the timing waveforms of Figure 5, the shaded areas on a 
waveform indicate no activity for that function for the 
specified duration, whereas unshaded areas indicate 

processing activity. 

The timing diagram in Figure 5 assumes Teu = Tio, 
i.e., pipeline cycle time is equal to I/O cycle time. First, 
the input data for block A is written to the input memory 11; 
this takes sixteen (Tio or Teu) cycles to bring in the 
sixteen data points of block A and is shown on the Input 
Memory write pattern (a). Similarly, the twiddles for block 
A are also written to the coefficient memory 9 in sixteen 
cycles. Note that the Coeff Memory Write (b) is delayed 16 
by eight cycles with respect to the Input Memory Write (a). 
This situation arises because the Coeff .Memory Read Al only 
empties three memory locations. As will be shown icter, the 
twiddle sequencing requires retransmission of three twiddles 
for the first column butterflies. Therefore, to keep the 
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input 11 and coefficient 9 Memory of the same depth, the 
Coeff Mernory Write (b) is delayed 16 by eight cycles with 
respect to the Input Memory Write (a). Two cycles prior to 
the completion of Input Memory Write (a), the execution unit 
7 accesses the input data memory 11 (Input Memory Read (d) ) 
and the coefficient memory 9 (Coeff Memory Read (c) ) to get 
the data for the first column butterflies of block A. Note 
that the read process only takes eight cycles 16 since the 
read bandwidth is twice the write bandwidth for these 
memories 9, 11, as later described. Therefore, two cycles 
are required to fetch the data for one radix-4 butterfly from 
the memories 9, 11- Hence, the execution unit activity can 
begin two cycles after the Coeff (c) and Input (d) Memory 
Read, as shown on the Execute waveform _(e) , Each column of a 
QR16 butterfly requires that eight cycles be initiated since 
there are four butterflies in each column. Note that the 
transitions shown on the Execute waveform (e) only indicate 
the beginning -of a new column of butterflies and not the 
total execution -time of a column through the pipeline. The 
radix-4 datapath on which the QR16 of the present invention 
is implemented has an execution latency of six cycles 18. 
Therefore, the results of Al start getting written into the 
cache memory six cycles 18 after the execution (Al) is 
begun. The Cache Read (A1/A2) (g) is initiated as soon as 
all the results of Al are available- The twiddles for column 
2 are accessed at the seme time (Coeff Memory Read A2) (c) 

-he Coefficient Memory 5. Execution for colun:.n 2, i.e.. 
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A2 cen be started two cycles Ister. Six cycles 20 after the 
start of A2 execution, the results of these computations can 
begin to be written (h) into the output memory 13, and the 
data in the cutout memory 13 can be read (i) to the I/O pins 
(Output Memory Read). For the output memory 13, the write 
bandwidth is twice the read bandwidth, as described later " 
herein, and thus the results of a column of butterflies are 
written in eight cycles and read in sixteen cycles. The 
sequence of operations described above is repeated for block 
B and every other block thereafter. It should be noted, 
however, that the shaded areas in the Execute timing waveform 
(e) indicate dead time periods 22 attributable to this data 
dependency in the processing through the QR16. Specifically, 
for this datapath, a dead time period 24 consisting of seven 
15 Teu (or Tio) cycles would be introduced into the pipeline 
between the Al and A2 execution. 

However, in accordance with the present invention 
this latency can be eliminated by interleaving the QR16 
computations of two 16-point blocks in the datapath. To 
accomplish this, twice as large on-board memories are used to 
store two 16-point data blocks on-chip at the same time, as 
illustrated in the timing diagram of Figure 6 which shows the 
timing diagram for two interleaved QR16 butterfly 
computations. The timing waveforms shown in Figure 5 have 
the same definitions as in Figure 5, end and B again rerer 
to two QH15 butterfly data blocks. The input data for the 
two QR16 butterfly date blocks (.-. and B) is written (a) to 
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the inout data cuemory 11 sequentially, ana this taRas 32 
(2x15) cycles. However, twiddle data for both block A and B 
is written (b) to the coefficient memory 9 in a scrarr.blec 
.ashion, which will be explained later, and is again delayed 
5 26 by sixteen cycles with respect to the Input Memory Write 
(.). once the input data for a pair of QH16 blocks is 
available in the input data memory 11, the execution unit 7 
reads the input data memory 11 to start the first column 
block A computations (e) (Al) . As soon as the data for the 
last butterfly of Al has been read, the first column of block 
B computations are initxated. ihe sequ_nc- 
just described for interleaved QR16 butterflies is 
illustrated in Figure 7a, and a more detailed version of the 
computation is illustrated in Figure 7b. It should be noted 
in Figure 6 that as soon as Bl computations are complete, the 
results of Al are already available for fetching from the 
cache memory 15 to compute A2 . As this computation is 
completed, the results of Bl are already available in the 
cache memory 15 for computing B2. Finally, this process car. 
be repeated for the next pair of QR16 data blocks, and so 
on. This interleaving scheme in accordance with the present 
invention requires 32 storage locations in all the on-board 
ruemories 5, 11, 13 and 15. As can be seen from the Execute 
timing waveform (e) in Figure 6, there are now no no-o? or 
shaded areas 2S on this waveform, thus indicating a 
continuously busy datapath. Figure 7c shows the twiddle or 
coefficient sequence for this interleaved scheme, as 
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explained below in detail later herein in connection with 
the addressing for the coefficient memory 9. 

Having defined the basic memory requirements for 
QR16 and having discussed the interleaved QR16 scheme, the 
5 details of each of the on-board memories will now be 
described. The cache memory 15 is a conventional 
double-buffered structure or dual-port memory to allow 
concurrent read/write for the pipelined datapath. As 
indicated in Figure 6, to keep the datapath pipeline 

10 continuously busy with the interleaving scheme, the cache 

memory 15 must be written with the results of Bl and at the 
same time the results of Al need to be read. The cache 
memory 15 is preferably comprised of four banks 24 bits wide 
and 16 words deep. Therefore, the cache memory 15 may have 

15 32 complex data words (48 bits) locations or 64 real data 
word (24 bits) locations. The word width is governed 
Directly by the data in the datapath on which the QS16 is 
implemented. The cache memory 15 must be able to supply and 
accept two complex words every clock cycle. The cache memory 

20 addressing can be accomplished for example, by a four-bit 
counter and a few multiplexers (mux) . Som.e conventional 
counter and mux logic supports sequential data ordering as 
well as conventional digit-reversed data ordering. As 
mentioned above, the results of the first column of a QR16 

25 butterfly are stored in digit-reversed order and read m 
seauential order for the second column calculations. 
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inoat data memory 11 and output data meinory 13 
olac=ed at the input and output of the datapath, respectively, 
are of conventional design and used for data buffering and 
concu-ent read/write operation. Both the input and output 
5 memories 11, U are also four banKs 24 bits wide and 16 words 
deep having 32 complex data words <4a bits) locations or 6. 
real data word (24 bits) locauions. Th_ m_mo-y 
structures are used here for the reasons that there is a 
speed gap between slow I/O and a fast datapath, and that 
there is a need for convenient simultaneous read/write 
operations. On the proposed datapath, a radix-4 butterfly 
can be executed every two pipeline cycles. This implies that 
two data quantities must be supplied every pipeline cycle (2 
data per cycle x 2 cycles = 4 data values per radix-4 
butterfly) if the data and twiddle have separated dedicated 
I/O ports. However, it is generally not possible to supply 
two data values every cycle with just one data and one 
twiddle port if the I/O cycle time is the same as the 
pipeline cycle time. This is the speed gap that exists which 
is eliminated by using memory structures 9, 11, 13 at the 
input and output of the datapath. The timing shown in Figure 
6 assumes one common clock between I/O and execution unit in 
order to minimize the complexity of the control structure. 
With a com.T.on clocPt, it becomes necessary to have the read 
bandwidth of the input meniory 11 be twice t.he write bandwidth 
and vice versa for the output memory 13. The timing shown 
therefore avoids boundary conditions in the memories for ease 
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of irRplerr^entation. Even if different clocks are used for 
the I/O and the dstspath, the use of memory structures at the 
input and output eases the interface problem between external 
circuitry and the internal datapath. 
5 In addition to the input and output memories 11, 

13, there is also a coefficient memory 9 which is used for 
storing the twiddle data, as shown in Figure 3a. The 
coefficient memory 9 is preferably comprised of four banks 24 
bits wide and 20 words deep. Thus, the coefficient memory 9 

10 may have 40 complex data words (48 bits) locations or 80 real 
data word (24 bits) locations. However, only 30 locations 
are used to store two sets of 15 twiddles corresponding to 
the pair of QR16 data blocks resident in the input memory 11 
for the interleaving scheme. The coefficient memory 9 

15 structure is otherwise similar to the input and output 

memories 11, 13 ' previous ly described. The coefficient memory 
9 also has a read bandwidth twice that of the write 
bandwidth, and is of conventional configuration for 
supporting non-sequential read addressing. It should be 

20 noted that three twiddle factors, Wl, W2 , and W3 , are shared 
across all radix-4 butterflies in the first column of data 
block A, as shown in Figure 7b. For the second column 
butterflies of data block A, 12 unique twiddles, W4 to W15, 
are read from the coefficient memory 9 sequentially. In the 

25 interleaving scheme according to the present invention as 

illustrated in Figure 7c, Wl to V?3 are read 4 times for the 
first column calculation of data block A (Ai) and then W15 to 
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KiS sre read 4 times for the first column calculation of 
data block 3 (Bl). Then W4 to WIS are read sequentially for 
second column calculations of data block A (A2) and W19 to 
W30 are read similarly for second column calculations of data 
5 block A (A2) and W19 and W30 are read sequentially for second 
column of data block B (B2) . The required read sequence for 
the twiddles in the interleaving scheme of the present 
invention is shown more clearly in Figure 7c which indicates 
the address locations of the twiddle inputs (numbered in 
hexadecimal notation). The coefficient memory 9 only 
supports read-retransmission from the required locations, and 
the combined sequence between the first 15 and the second 15 
twiddles is supported by an external address generator 21, as 

illustrated in Figure 10. 

In the QR16 according to the present invention, two 
levels of address generation are required. First, a given 
size EFT must be" composed of QR16 blocks and the appropriate 
number of passes of radix-16, radix-4 and radix-2 must be 
calculated. The equations for data and twiddle sequencing to 
perform an N point FFT by QR16 are shown in Figure 8. The 
data and twiddle addressing at this block level, where a pair 
of 16-point data blocks are brought from external memory 
(MEM) 32, 34 and 36, is done by an off-chip conventional 
address generator (AG) 21, 23, 25, as shown in Figure 10. As 
previously described, the external AG 21, 23, 25 must combine 
the twiddles of two QR16 blocks in the manner shown in Figure 
7c. At the second level, the address generation is required 
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within the pair of data blocks and this, as previously 
discussed, is handled on-chip by conventional addressing 
circuitry associated with each on-faoard memory 9, 11. 13 and 
15 , 

5 Referring now to Figure 9, there is shown a 

schematic diagram of processor circuitry according to the 
present invention for operation in the quasi-radix-lS 
embodiment. This circuitry and the datapath therethrough can 
accomplish a radix-4 complex butterfly every two cycles using 

10 6 multipliers 27, 29, 31, 33, 35 and 37, and with 3 stages of 
adders 39-42, 44-47 and 49-52, to permit a 16-cycle QR16 
butterfly processing of data and twiddles applied to input 
registers 53-66. The data bandwidth required by this 
datapath for the QR16 butterfly is achieved by doubling the 

15 read/write bandwidth of the on-board memories. Thus, on each 
execution cycle, the on-board memories can supply/accept two 
sets of two complex numbers. Thus, with reference to the Eq, 
1, above, the data for processing successively higher-order 
terms is arranged generally from left to right in that the Dq 

20 data inputs are supplied via data registers 53, 54, the 

data inputs are supplied via data registers 55, 56, the D2 
data inputs are supplied via data registers 57, 58 and the D3 
data inputs are supplied via data registers 59, 60. 
Similarly, the tv/iddle inputs are supplied via data 

25 registers 61, 62, the V<2 tvriddle inputs are .supplied via data 
registers 63, 64, and the W3 twiddle inputs are supplied via 
daca registers 65, 56. The groups of multiplexers 57, 68 ana 
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69, 70 ana 71, 72 are cross connected to facilitate the 
cotaplex cross-proauct multiplication in the respective 
multipliers during alternate cycles to yield products that 
are stored in registers 74-81. The initial products are 
5 arithmetically combined in adder/subtractor units 39-42 and 
stored in the registers 82-85. The outputs of registers 
82-85 again combined in a second stage of adder/subtractor 
units 44-47 and stored in data registers 87-94. The data in 
registers 87-94 is supplied via multiplexers 95-95 to the 

10 arithmetic logic units 49-52. Registers 98-102 are coupled 
to receive and store the results from the arithmetic logic 
units 49-52. The outputs of these registers 98-102 are 
coupled to a data formatter multiplexer 104 that organizes 
the data from the registers for output to the memories in the 

15 desired 24 or 48 bit format. Thus, the complex data for D,V1^ 
is processed in. multipliers 27, 29 during first and second 
cycles, the complex data for DjW's is processed in multipliers 
31, -33 during first and second cycles, and the complex data 
for D3W3 is processed in multipliers 35, 37 during first and 

20 second cycles. Ultimately, the processed complex outputs 

(say, from first column processing appear at outputs 106, 107 
for DO in one cycle, and at outputs 108, 109 for in the 
one cycle, and appear at outputs 106, 107 for D2 in the 
second cycle, and at outputs 103, 109 for D3 in the second 

25 cycle. With the aid of cache memory 15 on-chip, these 

outputs from first-column processing are stored to provide 
the data inputs for second-column processing according to the 
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quasi radix-16, as previously described. This circuit also 
supports raQix-2 and radix-4 butterfly processing 
operations. A rBdix-2 butterfly can be done in a single 
cycle in this circuitry, but I/O only supplies half of the 

5 required data in one cycle, which therefore limits the 

radix-2 processing to 2 cycles. Radix-4 butterfly processing 
can be executed in 2 cycles, but for the same reason, it 
takes 4 cycles to read the data. In this circuitry, the QR16 
butterfly is the only operation that achieves 100% execution 

10 utilization, as shown in Figure 6. More specifically, the 

QR16 according to the present invention is most efficient if 
the FFT size is an integral power of 16, for example 256 (16 
X IS), 4K (16 X le'x 16), as illustrated in Table 1 above, 
but is also capable of speeding up the computation by 30 to 

15 50% for other FFT sizes. Assuming a 20 nanoseconds I/O rate 
and a 40 nanoseconds radix-4 butterfly, the QR16 according to 
the present invention performs a IK FFT in 61.4 microseconds 
compared with 102.4 microseconds if only radi:-:-4 butterfly 
passes were used on the same circuitry. Performance 

20 improvements due to the QR16 according to the present 

invention for other FFT sizes are also tabulated in Table 1, 
above . 

Referring now to Figures 11-14, there is shown a 
pictorial representation of the preferred memory storage 
25 scheme for the coefficient, input, output and cache memories 
9, 11, 13 and 15. Each of the mem.ories 9, 11, 13 and 15 
comprise a left bank of complex memories (.^.^j3) and a right 
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10 



15 



2Q 



oank of co.olex mentories (CjD) . The rnerr.ories 9. U, 13 and 
15 preferably have 20 re addresses designated in the top row 
of each chart. Therefore, at each address one word rr.ay be 
written the left bank and another word may be written in the 
right bank of complex memory. The charts of Figures 11-14 
detail the write and read sequences for words written the 
respective memories. For example, as shown in Figure 11, the 
„.ite sequence for Radix 2, 4 or 16 (a) in the input data 
memory 11 writes word 0 at the left bank (A.jB) of address 0, 
word 1 at the right bank (CjD) of address 0, and word 2 at 
the left bank (A.jB) of address 1- Words 3-lf are similarly 
written according to the write sequence indicated in Figure 
llB. Exemplary schemes for reading and writing data in the 
memories 9, H, 13 and 15 are shown in remaining charts of 
Figures 11-14. The data is preferably retrieved and stored 
in the memories. 9, 11, 13 and 15 with address sequences 
ordered linearly or in some form of reverse addressing. As 
noted above, the data in the Coefficient Memory 9, may be 
ordered in a scrambled fashion to permit 100% use of the 
execution unit 7 as described above with reference to Figure 
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CLAIMS 



1. A processor for performing fest Fourier 
transforms, said processor comprising: 

an input memory having inputs and outputs for 

storing data; 

an output memory having inputs and outputs for 

storing data; 

a cache memory having inputs and outputs for 

storing data; 

a coefficient memory having inputs and outputs for 



10 



storing data; and 

an execution unit having inputs and outputs for 
performing logical and arithmetic operations on real and 
imaginary numbers, the outputs of said execution unit coupled 
15 to the inputs of the output memory and the cache memory, the 
inputs of said execution unit coupled to the outputs of the 
input memory, the outputs of the coefficient memory and the 
outputs of the cache memory. 

2. The processor of claim 1, wherein the input 
20 memory, the output memory, the cache memory and the 

coefficient memory are 32 words deep. 

3. The processor of claim 1, wherein the cache 
memory can supply and accept two words every clock cycle. 

4. The processor of claim 1, wherein the input 

25 memory and the coefficient memory have a read bandwidth twice 
the write bandwidth, and the output memory has a write 
bandwidth twice the read bandwidth. 
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5 . 



The processor of claim 1, wherein the 



10 



execution further comprises: 

a plurality of input registers having inputs ana 

output for storing data; 

a plurality of multipliers having inputs and 
outputs for performing cross product multiplication, the 
inputs of said plurality of multipliers coupled to the 
outputs of said registers; 

a plurality of arithmetic logic units having an 
inputs and outputs for performing logical and arithmetic 
operations on the data input, the inputs of said first stage 
coupled to the outputs of the arithmetic logic units; and 

a multiplexer having inputs and outputs for 
formatting the data input, the inputs of said m.ultiplexer 
coupled to the outputs of the plurality of multiplexers. 

6. The processor of claim 5, wherein plurality of 

arithmetic logic units comprise: 

a first stage of arithmetic logic units having an 
inputs and outputs for generating the sum or difference of 
data input, the inputs of said first stage coupled to the 
outputs of the multipliers; 

a second stage of arithmetic logic units having an 
inputs and outputs for generating sums and differences, the 
inputs of said second stage coupled to the outputs of the 

25 first stage; and 

a third stage of arithmetic logic units having en 
inputs and outputs for performing logical and arithmetic 
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operations, the inputs of said third stage coupled to the 
outputs of the second stage. 

7. The processor of claim 1, wherein the input 
memory, the output memory, the cache memory, the coefficient 

5 memory and the execution unit art constructed on a single 
chip . 

8. The processor of claim 1/ wherein the 
execution unit is a radix-4 processor. 

9. The processor of claim 1, wherein the 

10 execution unit, input memory, output memory, cache memory and 
coefficient memory are constructed on a single chip. 

10. A method for performing fast Fourier 
transforms on a quasi radix-16 processor having an execution 
unit, input memory, output memory, coefficient memory and 

15 cache memory, said method comprising the steps of: 

storing data in the input memory and coefficient 

memory; 

reading data from the input memory and coefficient 

memory ; 

20 performing calculations on the data from input 

memory and coefficient memory with the execution unit; 

storing the output of the execution unit in cache 

m.emory ; 

reading data from cache memory; 
25 performing calculations on the data from cache 

memory, input memory and coefficient memory with the 
execution unit; and 
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